Regular Expressions Primer

Introduction

Many people use different names to refer to Regular Expressions, such as Regexp, Regex, ASCII puke(!).

Regex are a way to express a set of strings. They can be used to 'scrub' website to extract links, informations, titles.. They are just simply a series of codes that you use to define exactly what text you are looking for. And text can be numbers letters punctuation etc.

This code is used to match patterns. lets see an example. I want to find all numbers, one to five digit long followed by a space and the words "Hello World".

For example: 12345 Hello World. would be defined by \d{1,5}\s\w+\s\w+\.

We define the pattern with the following syntax. The slash is delimiting the start of a regular expression.

\d{1,5}

So this means: A digit between 1 to 5 digits in length, whenever you use these curly brackets, you are saying that you expect between n to m digits

Then I see in our example, the numbers are followed by a space. The Regular expression for a space is

\s

After this you expect a series of character which will be from one to multiple letters in length. This w represents anything which is a character, and the + means one or more.

\w+

again followed by space and a word

\s\w+

then the point. The point can be interpreted in regex to mean any character except a newline. So I need to put a backslash.

\.

We now put all this together:

\d{1,5}\s\w+\s\w+\.

Nice isn't it?

A bit of theory

These are the common used parameters:

\d is any number (digit)

\D with the capital D is anything but a number (digit)

\s A space

\S Anything but a space

\w Any character

\W Anything but a character

. matches any character except a line break

\b match for a space which precedes or follows a whole word

? zero or one repetition of the code which precedes it

* One or more repetition

{n} If you know exactly how many repetitions, insert the amount in curly brackets

{m,n} Here means m to n repetitions as seen above

\e escaped whitespace

\f form feed

\n newline

\r carriage return

\t horizontal tab

Using curly brackets to find commonly mispelled words. For example, tomorrow can be mispelled as tommorow, tommorrow..

We can find them in this way:

\tomm{0,1}orr{0,1}ow

[a-z] any lower case letter

[A-Z] any upper case letter

[0-9] any number

[a-cC-R3-4] any letter from a to c or uppercase letter from C to R or number from 3-4

An example

Lets look for the name Catherine in our text. But we want to find her last name as well.

We will use the OR operator | because we want to find the name but also the shorter verions of her name such of Cath or Cathy, followed by a trailing space and another word of one of more characters:

\Cat(herine|h)\b\w+\b

This is the same as

\(Catherine|Cath)\b\w+\b

if I add \1 at the end, it means take what it is in the parentheses and redo it again

\(Catherine|Cath)\b\w+\b\1 same as \(Catherine|Cath)\b\w+\b\(Catherine|Cath)

The caret ^ means at the beginning of the text. So to find all sentences starting with Dog

(^Dog\s.+$)

This will find all lines starting with dog followed by a space and any or more chars (but no newline) and ending ($).

Regex with Python

So now we apply what we have learned to use it in python



# import our regex library
import re
# lets take a random text
# if we have a file we can use open('filename.txt')
text = "aAsdfgh$%#n Cat iuhi"
# define our regular expression in the function compile
expression = re.compile('Cat')
# tell python to look for it
pattern = re.search(expression, text)
# print the result using the group function
print(pattern.group())

# To find all occurrences there is the method findall
# lets print all the letters and only the letters in our text:

strToSearch = ""

for line in text:
    strToSearch += text
    
print(strToSearch)    

expression = re.compile('[a-z]',re.IGNORECASE)
pattern2 = re.findall(expression,text)

for i in pattern2:
    print(i,end='')

Example from fastai DL course v3

Lets see this example as seen on fastai deep learning for coders version 3. We have a set of image files in a folder with names like

'data/oxford-iiit-pet/images/americanbulldog146.jpg'

The label for that particular picture is in the name of the file. Lets see how we can extract it with a regex pattern:

/([^/]+)_\d+.jpg$

/ starts the regex

() in the parentheses is what we are looking for, a greoup of characters defined as:

[^/]+ the ^ is a negation, so it says no forward slash and + arbitrarily long so ([^/]+) is looking for a group of characters that do not contain forward slashes and are arbitrarily long.

_ it is literally expecting an underscore character here

The RegEx \d refers to numerical digits and the plus + sign that comes after it means that there may be arbitrarily many digits. This looks for the numerical ID of the images.

.jpg$ The dollar sign at the end tells that we are matching only files which end with .jpg

The python code to extract the name would be:

import re

string = 'data/oxford-iiit-pet/images/american_bulldog_146.jpg'

pat = r'([^/]+)_\d+.jpg$'

pat = re.compile(pat)

print(pat.search(string).group(1))

# we get this result and not the whole string, 
# because on the previous expression, we requested 'group 1' which is 
# that part of the regex with the parentheses
>american_bulldog

Regex in HTML5

Regex can be used in HTML5 for form validation with the pattern attribute.

<input name="zip" pattern="\d{5}" /> 

In the command line

cd to the folder where you are looking for a specific file. Suppose you are looking for all files like with a name like aaa_431x303.jpg We can use find with -regex and the pattern to fild them:

find -E . -regex '.*[0-9]{2,4}x[0-9]{2,4}\.jpg'

Resources